Take-home Exercise 2: Application of Spatial and Spatio-temporal Analysis Methods to Discover the Distribution of Dengue Fever in Tainan City, Taiwan

Overview

Setting The Scene

Dengue Hemorrhagic Fever (in short dengue fever) is one of the most widespread mosquito-borne diseases in the most tropical and subtropical regions. It is an acute disease caused by dengue virus infection which is transmitted by female Aedes aegypti and Aedes albopictus mosquitoes. In 2015, Taiwan had recorded the most severe dengue fever outbreak with more than 43,000 dengue cases and 228 deaths. Since then, the annual reported dengue fever cases were maintained at the level of not more than 200 cases. However, in 2023, Taiwan recorded 26703 dengue fever cases. More than 25,000 cases were reported at Tainan City, and more than 80% of the reported dengue fever cases occurred in the month August-November 2023 and epidemiology week 31-50.

Objectives

As a curious geospatial analytics green horn, you are interested to discover:

  • if the distribution of dengue fever outbreak at Tainan City, Taiwan are independent from space and space and time.
  • If the outbreak is indeed spatial and spatio-temporal dependent, then, you would like to detect where are the clusters and outliers, and the emerging hot spot/cold spot areas.

The Task The specific tasks of this take-home exercise are as follows:

  • Using appropriate function of sf and tidyverse, preparing the following geospatial data layer:
  1. a study area layer in sf polygon features. It must be at village level and confined to the D01, D02, D04, D06, D07, D08, D32 and D39 counties of Tainan City, Taiwan.
  2. a dengue fever layer within the study area in sf point features. The dengue fever cases should be confined to epidemiology week 31-50, 2023.
  3. a derived dengue fever layer in spacetime s3 class of sfdep. It should contain, among many other useful information, a data field showing number of dengue fever cases by village and by epidemiology week.
  4. Using the extracted data, perform global spatial autocorrelation analysis by using sfdep methods.
  • Using the extracted data, perform local spatial autocorrelation analysis by using sfdep methods.
  • Using the extracted data, perform emerging hotspot analysis by using sfdep methods.
  • Describe the spatial patterns revealed by the analysis above.

The Data

For the purpose of this take-home exercise, two data sets are provided, they are:

TAIWAN_VILLAGE_2020, a geospatial data of village boundary of Taiwan. It is in ESRI shapefile format. The data is in Taiwan Geographic Coordinate System. (Source: Historical map data of the village boundary: TWD97 longitude and latitude)

Dengue_Daily.csv, an aspatial data of reported dengue cases in Taiwan since 1998. (Source: Dengue Daily Confirmed Cases Since 1998. Below are selected fields that are useful for this study:

  • 發病日: Onset date
  • 最小統計區中心點X: x-coordinate
  • 最小統計區中心點Y: y-coordinate Both data sets have been uploaded on eLearn. Students are required to download them from eLearn.

Getting Started

Loading Packages

We can use this code chunk to load the required packages

Loading Data and Data Wrangling

Aspatial Data

We can load the dengue_daily data with the code chunk below. Since we are only stuyding cases from week 31-50, we can use filter() to get the dates from 31 July to 17 December, which is week 31-50

Rows: 25,480
Columns: 26
$ 發病日             <date> 2023-07-31, 2023-07-31, 2023-07-31, 2023-07-31, 20…
$ 個案研判日         <chr> "2023/07/31", "2023/07/31", "2023/07/31", "2023/07/…
$ 通報日             <date> 2023-07-31, 2023-07-31, 2023-07-31, 2023-07-31, 20…
$ 性別               <chr> "女", "男", "男", "男", "男", "男", "女", "女", "男…
$ 年齡層             <chr> "30-34", "55-59", "5-9", "70+", "55-59", "30-34", "…
$ 居住縣市           <chr> "台南市", "台南市", "台南市", "台南市", "台南市", "…
$ 居住鄉鎮           <chr> "永康區", "東區", "永康區", "仁德區", "永康區", "古…
$ 居住村里           <chr> "埔園里", "大智里", "五王里", "成功里", "中興里", "…
$ 最小統計區         <chr> "A6731-0201-00", "A6732-1040-00", "A6731-0574-00", …
$ 最小統計區中心點X  <chr> "120.253752333", "120.232374917", "120.235733496", …
$ 最小統計區中心點Y  <chr> "23.031699814", "22.962366283", "23.013083716", "22…
$ 一級統計區         <chr> "A6731-16-006", "A6732-64-001", "A6731-42-008", "A6…
$ 二級統計區         <chr> "A6731-16", "A6732-64", "A6731-42", "A6727-10", "A6…
$ 感染縣市           <chr> "台南市", "台南市", "台南市", "台南市", "台南市", "…
$ 感染鄉鎮           <chr> "永康區", "東區", "永康區", "仁德區", "永康區", "古…
$ 感染村里           <chr> "None", "None", "None", "None", "None", "None", "No…
$ 是否境外移入       <chr> "否", "否", "否", "否", "否", "否", "否", "否", "否…
$ 感染國家           <chr> "中華民國", "中華民國", "中華民國", "中華民國", "中…
$ 確定病例數         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ 居住村里代碼       <chr> "6703100-004", "6703200-003", "6703100-001", "67027…
$ 感染村里代碼       <chr> "None", "None", "None", "None", "None", "None", "No…
$ 血清型             <chr> "None", "None", "None", "None", "第一型", "第一型",…
$ 內政部居住縣市代碼 <chr> "67", "67", "67", "67", "67", "10009", "67", "67", …
$ 內政部居住鄉鎮代碼 <chr> "6703100", "6703200", "6703100", "6702700", "670310…
$ 內政部感染縣市代碼 <chr> "67", "67", "67", "67", "67", "10009", "67", "67", …
$ 內政部感染鄉鎮代碼 <chr> "6703100", "6703200", "6703100", "6702700", "670310…

In order to save up on computational resources and make it more readable, we will only take up the three fields mentioned above and rename it to English

Rows: 25,480
Columns: 3
$ OnsetDate    <date> 2023-07-31, 2023-07-31, 2023-07-31, 2023-07-31, 2023-07-…
$ X_Coordinate <chr> "120.253752333", "120.232374917", "120.235733496", "120.2…
$ Y_Coordinate <chr> "23.031699814", "22.962366283", "23.013083716", "22.95747…

Since the coordinates are still in chr format, we need to convert it to numeric first

   OnsetDate           X_Coordinate    Y_Coordinate  
 Min.   :2023-07-31   Min.   :118.3   Min.   :21.99  
 1st Qu.:2023-09-11   1st Qu.:120.2   1st Qu.:22.97  
 Median :2023-10-01   Median :120.2   Median :22.99  
 Mean   :2023-10-03   Mean   :120.3   Mean   :23.01  
 3rd Qu.:2023-10-26   3rd Qu.:120.3   3rd Qu.:23.02  
 Max.   :2023-12-17   Max.   :121.9   Max.   :25.20  
                      NA's   :14      NA's   :14     

Since the data still have some missing values, let’s clean that up first

   OnsetDate           X_Coordinate    Y_Coordinate  
 Min.   :2023-07-31   Min.   :118.3   Min.   :21.99  
 1st Qu.:2023-09-11   1st Qu.:120.2   1st Qu.:22.97  
 Median :2023-10-01   Median :120.2   Median :22.99  
 Mean   :2023-10-03   Mean   :120.3   Mean   :23.01  
 3rd Qu.:2023-10-26   3rd Qu.:120.3   3rd Qu.:23.02  
 Max.   :2023-12-17   Max.   :121.9   Max.   :25.20  

After the data is clean, we can convert it into sf. Remember to convert the crs to TWD97 (crs=3824)

Rows: 25,466
Columns: 2
$ OnsetDate <date> 2023-07-31, 2023-07-31, 2023-07-31, 2023-07-31, 2023-07-31,…
$ geometry  <POINT [°]> POINT (120.2538 23.0317), POINT (120.2324 22.96237), P…

Geospatial Data

We can use st_read() to load the geospatial data, with an additional filter to get only the counties mentioned above. We can plot it to get a better understanding of the data

Reading layer `TAINAN_VILLAGE' from data source 
  `C:\Users\yozaf\SMUY3S2\Geospatial\IS415-GAA\Take-home_Ex\Take-home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 649 features and 10 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 120.0269 ymin: 22.88751 xmax: 120.6563 ymax: 23.41374
Geodetic CRS:  TWD97
Rows: 258
Columns: 11
$ VILLCODE   <chr> "67000350032", "67000270011", "67000370005", "67000330004",…
$ COUNTYNAME <chr> "臺南市", "臺南市", "臺南市", "臺南市", "臺南市", "臺南市",…
$ TOWNNAME   <chr> "安南區", "仁德區", "中西區", "南區", "安南區", "安南區", "…
$ VILLNAME   <chr> "青草里", "保安里", "赤嵌里", "大成里", "城北里", "城南里",…
$ VILLENG    <chr> "Qingcao Vil.", "Bao'an Vil.", "Chihkan Vil.", "Dacheng Vil…
$ COUNTYID   <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D",…
$ COUNTYCODE <chr> "67000", "67000", "67000", "67000", "67000", "67000", "6700…
$ TOWNID     <chr> "D06", "D32", "D08", "D02", "D06", "D06", "D08", "D06", "D0…
$ TOWNCODE   <chr> "67000350", "67000270", "67000370", "67000330", "67000350",…
$ NOTE       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ geometry   <POLYGON [°]> POLYGON ((120.1176 23.08387..., POLYGON ((120.2304 …

Data Wrangling

Now we join the two data with st_join. Here, we use EpiWeek to group together cases in the same week. Don’t forget to handle missing values as well

   OnsetDate                   geometry       VILLCODE        
 Min.   :2023-07-31   POINT        :18800   Length:18800      
 1st Qu.:2023-09-09   epsg:3824    :    0   Class :character  
 Median :2023-09-27   +proj=long...:    0   Mode  :character  
 Mean   :2023-09-28                                           
 3rd Qu.:2023-10-17                                           
 Max.   :2023-12-17                                           
  COUNTYNAME          TOWNNAME           VILLNAME           VILLENG         
 Length:18800       Length:18800       Length:18800       Length:18800      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   COUNTYID          COUNTYCODE           TOWNID            TOWNCODE        
 Length:18800       Length:18800       Length:18800       Length:18800      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    EpiWeek     
 Min.   :31.00  
 1st Qu.:36.00  
 Median :39.00  
 Mean   :39.06  
 3rd Qu.:42.00  
 Max.   :50.00  

After joining, we can make a new sf data to store the summary of how many cases appear in a specific village in a specific week.

   VILLCODE            EpiWeek      NumberOfCases            geometry   
 Length:3514        Min.   :31.00   Min.   : 1.00   MULTIPOINT   :2523  
 Class :character   1st Qu.:36.00   1st Qu.: 1.00   POINT        : 991  
 Mode  :character   Median :40.00   Median : 3.00   epsg:3824    :   0  
                    Mean   :39.87   Mean   : 5.35   +proj=long...:   0  
                    3rd Qu.:44.00   3rd Qu.: 7.00                       
                    Max.   :50.00   Max.   :56.00                       

Now, to get the geometry of the villages, we can combine it with the tainan dataset. Before combining, we need to convert dengue_summary to df by dropping the geometry column. But, note that this df data does not contain rows where there are no case at all in a particular village in a particular week. To fix that, we can use complete() with the parameters in the chunk below. What it does is that it will automatically add a row for every combination of VILLCODE and EpiWeek possible, and if there are no cases for that particular combination, the fill argument will add a 0 instead of NA in the NumberOfCases column

Then, we can use left_join() to join the datasets based on the same “VILLCODE”. We can use select() to retrieve only the columns we need. Remember to turn the joined data back to sf

Let’s check the CRS of the new sf to make sure that its in the same reference system

Coordinate Reference System:
  User input: TWD97 
  wkt:
GEOGCRS["TWD97",
    DATUM["Taiwan Datum 1997",
        ELLIPSOID["GRS 1980",6378137,298.257222101,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["Taiwan, Republic of China - onshore and offshore - Taiwan Island, Penghu (Pescadores) Islands."],
        BBOX[17.36,114.32,26.96,123.61]],
    ID["EPSG",3824]]

Creating Spatiotemporal Layer

To create a spatiotemporal layer, we can use as_spacetime() for the dengue_summary_sf, with VILLCODE for the spatial column and EpiWeek to represent the temporal column

To see the data, we can use activate() with ‘data’ for the second argument

# A tibble: 5,140 × 3
   VILLCODE    EpiWeek NumberOfCases
 * <chr>         <dbl>         <int>
 1 67000270001      31             0
 2 67000270001      32             0
 3 67000270001      33             1
 4 67000270001      34             1
 5 67000270001      35             2
 6 67000270001      36             3
 7 67000270001      37             5
 8 67000270001      38             4
 9 67000270001      39             3
10 67000270001      40             2
# ℹ 5,130 more rows

On the other hand, the geometry values can be seen with activate() also, only with ‘geometry’ for the second argument

Simple feature collection with 257 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 120.0627 ymin: 22.89401 xmax: 120.2925 ymax: 23.09144
Geodetic CRS:  TWD97
# A tibble: 257 × 2
   VILLCODE                                                             geometry
 * <chr>                                                           <POLYGON [°]>
 1 67000270001 ((120.2672 22.99653, 120.2673 22.99616, 120.2675 22.9958, 120.26…
 2 67000270002 ((120.266 22.99104, 120.266 22.99096, 120.2659 22.99083, 120.265…
 3 67000270003 ((120.2492 22.98265, 120.2497 22.9826, 120.2504 22.98251, 120.25…
 4 67000270004 ((120.2391 22.98008, 120.2393 22.98001, 120.2394 22.97996, 120.2…
 5 67000270005 ((120.2578 22.97427, 120.2584 22.97398, 120.2591 22.97356, 120.2…
 6 67000270006 ((120.2713 22.96793, 120.2712 22.96777, 120.2712 22.96766, 120.2…
 7 67000270007 ((120.2408 22.959, 120.2422 22.95846, 120.2435 22.95791, 120.244…
 8 67000270008 ((120.2701 22.94837, 120.2701 22.94824, 120.27 22.94819, 120.27 …
 9 67000270011 ((120.2304 22.93544, 120.2306 22.93519, 120.2319 22.93271, 120.2…
10 67000270012 ((120.2251 22.96159, 120.2256 22.96149, 120.2261 22.9614, 120.22…
# ℹ 247 more rows

Before using the spacetime layer in further computations, we can check if it is a proper cube

[1] TRUE

Visualizing Data

Now, we are plotting a choropleth to see the distribution of cases using tmap

Global Measures of Spatial Autocorrelation

Computing Contiguity Spatial Weights

Before we can compute the global spatial autocorrelation statistics, we need to construct a spatial weights of the study area. The code chunk below will use sfdep methods to derive the contiguity weights

Neighbour list object:
Number of regions: 5140 
Number of nonzero links: 704060 
Percentage nonzero weights: 2.664915 
Average number of links: 136.9767 

Computing Global Moran’ I

We can use global_moran() to compute the Global Moran’ I value, which will return a tibble as an output

List of 2
 $ I: num 0.15
 $ K: num 17.1

Global Moran’ I Test

We will use global_moran_test() to perform the Moran’ I test


    Moran I test under randomisation

data:  x  
weights: listw    

Moran I statistic standard deviate = 87.954, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     1.497162e-01     -1.945904e-04      2.905083e-06 

Global Moran’I Permutation Test

In a realistic setting, we can use Monte Carlo simulation to perform the test. For reproducibility, we can set the seed in the start


    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.14972, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided

The statistical report above show that the p-value is smaller than alpha value of 0.05. Hence, we have enough statistical evidence to reject the null hypothesis that the spatial distribution of cases resemble random distribution (i.e. independent from spatial). Because the Moran’s I statistics is greater than 0, We can infer that the spatial distribution shows sign of clustering.

Local Measures of Spatial Autocorrelation

Computing Local Moran’ I

To compute Local Moran’ I, we can use local_moran() instead

Visualizing Local Moran’ I

Next, tmap can be used to make a choropleth using the “ii” field (local moran statistic)

Visualizing p-value of Local Moran’ I

tmap can also be used to plot the p-value in the “p_ii_sim” column

Visualizing LISA Map

LISA map is a categorical map showing outliers and clusters. In lisa sf data.frame, we can find three fields contain the LISA categories. They are mean, median and pysal. In general, classification in mean will be used as shown in the code chunk below.

Here is an explanation of the different categories:

High-High (HH): The location has a high value and is surrounded by neighbors with high values. Low-Low (LL): The location has a low value and is surrounded by neighbors with low values. Low-High (LH): The location has a low value but is surrounded by neighbors with high values. High-Low (HL): The location has a high value but is surrounded by neighbors with low values.

Emerging Hot Spot Analysis

Emerging Hot Spot Analysis (EHSA) is a spatio-temporal analysis method for revealing and describing how hot spot and cold spot areas evolve over time

Derive Spatial Weights

In EHSA,we will use the space-time cube we created before. First, we derive distance weights using the code chunk below

Computing Gi*

We can use these new columns to manually calculate the local Gi* for each location. We can do this by grouping by week and using local_gstar_perm() of sfdep package. After which, we use unnest() to unnest gi_star column of the newly created gi_starts data.frame.

Mann-Kendall Test

With these Gi* measuers, we can perform a Mann-Kendall test to identify a trend. For this example, we can look at VILLCODE 67000270001

And plot the result using ggplot

# A tibble: 1 × 5
    tau    sl     S     D  varS
  <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.105 0.538    20  190.   950

The cbg tells us that sl (p-value) is 0.5376031. This shows that there is a significant trend

Now we can use group_by to perform the test for every location

Arranging To Show Significant Emerging Hot/Cold Spots

Use the code chunk to arrange to show significant emerging hot/cold spots

Performing Emerging HotSpot Analysis

Next, we can use emerging_hotspot_analysis() to perform EHSA. It takes in a spacetime object x, variable of interest y, number of time lags (set by default to 1), and nsim = 99 to indicate that there will be 99 simulations

Visualizing the Distribution of EHSA Classes

We can use ggplot to see the distribution of EHSA classes

Visualizing EHSA

We will start on how to visualize the geographic distribution EHSA classes. First, we join the two tables together with left_join

Next, we can make a choropleth map

References

https://sfdep.josiahparry.com/reference/global_moran_test.html https://is415-gaa-tskam.netlify.app/in-class_ex/in-class_ex05/in-class_ex05-ehsa#do-it-yourself https://sfdep.josiahparry.com/articles/spacetime-s3#spacetime-cubes